A Hybrid Approach for Entity Extraction in Code-Mixed Social Media Data

نویسندگان

  • Deepak Kumar Gupta
  • Shweta
  • Shubham Tripathi
  • Asif Ekbal
  • Pushpak Bhattacharyya
چکیده

Entity extraction is one of the important tasks in various natural language processing (NLP) application areas. There has been a significant amount of works related to entity extraction, but mostly for a few languages (such as English, some European languages and few Asian languages) and doamins such as newswire. Nowadays social media have become a convenient and powerful way to express one’s opinion and sentiment. India is a diverse country with a lot of linguistic and cultural variations. Texts written in social media are informal in nature, and perople often use more than one script while writing. User generated content such as tweets, blogs and personal websites of people are written using Roman script or sometimes users may use both Roman as well as indigenous scripts. Entity extraction is, in general, a more challenging task for such an informal text, and mixing of codes further complicates the process. In this paper, we propose a hybrid approah for enity extraction from code mixed language pairs such as English-Hindi and EnglishTamil. We use a rich linguistic feature set to train Conditional Random Field (CRF) classifier. The output of classifier is post-processed with a carefully hand-crafted feature set. The proposed system achieve the F-scores of 62.17% and 44.12% for English-Hindi and English-Tamil language pairs, respectively. Our system attains the best F-score among all the systems submitted in Fire 2016 shared task for the English-Tamil language pairs. CCS Concepts •Computing methodologies→Natural Language Processing; •Information System → Information Extraction; •Algorithm → Conditional Random Field(CRF);

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Named Entity Recognition for Code Mixing in Indian Languages using Hybrid Approach

Automating the process of Named Entity Recognition has received a lot of attention over past few years in Social Media Text. Named Entities are real world objects such as Person, Organization, Product, Location. Identifying these entities in social media text is an important challenging task due the informal nature of text present on social media. One such challenge that is faced in recognizing...

متن کامل

Conditional Random Fields for Code Mixed Entity Recognition

Entity Recognition is an essential part of Information Extraction, where explicitly available information and relations are extracted from the entities within the text. Plethora of information is available in social media in the form of text and due to its nature of free style representation, it introduces much complexity while mining information out of it. This complexity is enhanced more by r...

متن کامل

A Novel Approach to Conditional Random Field-based Named Entity Recognition using Persian Specific Features

Named Entity Recognition is an information extraction technique that identifies name entities in a text. Three popular methods have been conventionally used namely: rule-based, machine-learning-based and hybrid of them to extract named entities from a text. Machine-learning-based methods have good performance in the Persian language if they are trained with good features. To get good performanc...

متن کامل

Sentiment Identification in Code-Mixed Social Media Text

Sentiment analysis is the Natural Language Processing (NLP) task dealing with the detection and classification of sentiments in texts. While some tasks deal with identifying presence of sentiment in text (Subjectivity analysis), other tasks aim at determining the polarity of the text categorizing them as positive, negative and neutral. Whenever there is presence of sentiment in text, it has a s...

متن کامل

AMRITA_CEN@FIRE 2016: Code-Mix Entity Extraction for Hindi-English and Tamil-English Tweets

Social media text holds information regarding various important aspects. Extraction of such information serves as the basis for the most preliminary task in Natural Language Processing called Entity extraction. The work is submitted as a part of Shared task on Code Mix Entity Extraction for Indian Languages(CMEE-IL) at Forum for Information Retrieval Evaluation (FIRE) 2016. Three different meth...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016